Document image cleanup and binarization

نویسندگان

  • Victor Wu
  • R. Manmatha
چکیده

Image binarization is a diicult task for documents with text over textured or shaded backgrounds, poor contrast, and/or considerable noise. Current optical character recognition (OCR) and document analysis technology do not handle such documents well. We have developed a simple yet eeective algorithm for document image clean-up and binarization. The algorithm consists of two basic steps. In the rst step, the input image is smoothed using a low-pass (Gaussian) lter. The smoothing operation enhances the text relative to any background texture. This is because background texture normally has higher frequency than text does. The smoothing operation also removes speckle noise. In the second step, the intensity histogram of the smoothed image is computed and a threshold automatically selected as follows. For black text, the rst peak of the histogram corresponds to text. Thresholding the image at the value of the valley between the rst and second peaks of the histogram binarizes the image well. In order to reliably identify the valley, the histogram is smoothed by a low-pass lter before the threshold is computed. The algorithm has been applied to some 50 images from a wide variety of sources: digitized video frames, photos, newspapers, advertisements in magazines Any opinions, ndings and conclusions or recommendations expressed in this material are the author(s) and do not necessarily reeect those of the sponsors. 1 or sales yers, personal checks, etc. There are 21820 characters and 4406 words in these images, 91% of the characters and 86% of the words are successfully cleaned up and binarized. A commercial OCR was applied to the binarized text when it consisted of fonts which were OCR recognizable. The recognition rate was 84% for the characters and 77% for the words.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Document Analysis And Classification Based On Passing Window

In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...

متن کامل

Ancient Document Images Enhancement Using Phase Based Binarization

In this paper, we present a phase-based binarization model for degraded document images, also a post processing method that can improve any binarization method and a ground truth generation tool. Usually, many binarization techniques are implemented in the literature for different types of binarization problems. It include an adaptive image contrast based document image binarization technique t...

متن کامل

Document Image Binarization Using Threshold Segmentation

Binarization is process to generate binary image from document image. Document image binarization has already under research from past many years, and many binarization algorithms have been proposed for different types of degraded document images. Document image Binarization is very popular to upgrade old handwritten and machine printed documents. Still to recover degraded document is very tedi...

متن کامل

Foreground-Background Regions Guided Binarization of Camera-Captured Document Images

Binarization is an important preprocessing step in several document image processing tasks. Nowadays handheld camera devices are in widespread use, that allow fast and flexible document image capturing. But, they may produce degraded grayscale image, especially due to bad shading or non-uniform illumination. State-of-the-art binarization techniques, which are designed for scanned images, do not...

متن کامل

A Survey on Degraded Document Image Binarization Techniques

the method of segmentation in the image binarization technique is the major technique used for the separation of pixel values into dual collections, black as foreground and white as background. The degraded images of a document are segmented by using the image binarization technique in order to acquire the clear images exact to that of the original images of documents. Thresholding process is t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998